Error Bounds in Reinforcement Learning Policy Evaluation
نویسنده
چکیده
With the advent of Kearns & Singh’s (2000) rigorous upper bound on the error of temporal difference estimators, we derive the first rigorous error bound for the maximum likelihood policy evaluation method as well as deriving a Monte Carlo matrix inversion policy evaluation error bound. We provide, the first direct comparison between the error bounds of the maximum likelihood (ML), Monte Carlo matrix inversion (MCMI) and temporal difference (TD) estimation methods for policy evaluation. We use these bounds to confirm generally held notions of the superior accuracy of the model-based estimation methods of ML and MCMI over the model-free method of TD. With our error bounds, we are also able to specify parameters and conditions that affect each method’s estimation accuracy.
منابع مشابه
A note on the function approximation error bound for risk-sensitive reinforcement learning
In this paper we obtain several error bounds on function approximation for the policy evaluation algorithm proposed by Basu et al. when the aim is to find the risk-sensitive cost represented using exponential utility. We also give examples where all our bounds achieve the “actual error” whereas the earlier bound given by Basu et al. is much weaker in comparison. We show that this happens due to...
متن کاملEfficient Probabilistic Performance Bounds for Inverse Reinforcement Learning
In the field of reinforcement learning there has been recent progress towards safety and high-confidence bounds on policy performance. However, to our knowledge, no methods exist for determining high-confidence safety bounds for a given evaluation policy in the inverse reinforcement learning setting—where the true reward function is unknown and only samples of expert behavior are given. We prop...
متن کاملOn the Use of Non-Stationary Strategies for Solving Two-Player Zero-Sum Markov Games
The main contribution of this paper consists in extending several non-stationary Reinforcement Learning (RL) algorithms and their theoretical guarantees to the case of γdiscounted zero-sum Markov Games (MGs). As in the case of Markov Decision Processes (MDPs), non-stationary algorithms are shown to exhibit better performance bounds compared to their stationary counterparts. The obtained bounds ...
متن کاملFinite-sample analysis of least-squares policy iteration
In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a...
متن کاملDynamic policy programming
In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. DPP is an incremental algorithm that forces a gradual change in policy update. This allows to prove finite-iteration and asymptotic l∞-norm performance-loss bounds in the presence of approximation/estimation error w...
متن کامل